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Abstract 

Trissolcus basalis (Wollaston) is a minute parasitic wasp that develops in the eggs of stink bugs. Over the 
past 30 years, 77. basalis has become a model organism for studying host finding, patch defense behavior, 
and chemical ecology. As an entry point to better understand the molecular basis of these factors, in ad- 
dition to filling a critical gap in the genomic resources available for parasitic Hymenoptera, we sequenced 
and assembled the genome of 77. basalis using short (454, Illumina) and long read (Oxford Nanopore) 
sequencing technologies. The three sequencing methods produced 32 million reads (4.10 Gb; 27.9x), 
which were assembled into 7,586 scaffolds. The 147 Mb (N50: 42.8 kb) assembly contains complete se- 
quences for 93.1% of the insect BUSCO dataset, and an extensive annotation protocol resulted in 14,158 
protein-coding gene models, 12,197 (86%) of which had a blast hit in GenBank. Repetitive elements 
comprised 13.8% of the genome, and a phylogenomic analysis recovered 77. basalis as sister to Chalci- 
doidea, a result in line with other studies. We identified 174 rapidly evolving gene families in 77. basalis, 
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including olfactory receptors and pheromone/general odorant binding proteins. These genetic elements 
are an obligatory portion of the parasitoid-host relationship, and the draft genome of 77. basalis has and 


will continue to be useful in elucidating these relationships at finer resolution. 
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Introduction 


Trissolcus basalis (Wollaston) (Hymenoptera: Scelionidae) is a minute, solitary parasi- 
toid of stink bug eggs (Hemiptera: Pentatomoidea), principally the cosmopolitan pest 
Nezara viridula (L.) (Pentatomidae). ‘This parasitoid is found primarily in tropical and 
subtropical regions, where it has been used effectively in the biological control of its 
host (Davis 1964; Clarke 1990; Corréa-Ferreira and Moscardi 1996). Given the eco- 
nomic importance of its host, considerable effort has been expended to elucidate how 
female 77. basalis locate N. viridula eggs in the narrow window of time during which 
they are susceptible to attack (Bin et al. 1993; Colazza et al. 1999, 2004; Salerno et 
al. 2006; Laumann et al. 2009). Host location and acceptance by female wasps are 
known to be mediated by chemical cues, some of which have been isolated and identi- 
fied (Mattiacci 1993; Colazza et al. 2004, 2007). This effort to sequence the genome 
of Tr. basalis was undertaken as a step in characterizing its repertoire of chemoreceptor 
proteins (Chen et al. 2021a) and to better understand the mechanisms of host finding 
in platygastroid wasps and their evolutionary consequences. 


Methods 
Whole-genome sequencing 


454 Life sciences 


Sequencing followed the protocol of Mao et al. (2012). Briefly, DNA was extracted 
from 25 adult male 77 basalis from a colony maintained at the Universita di Perugia 
(Perugia, Italy). Sequencing was conducted at the University of Pennsylvania Perelman 
School of Medicine on a Roche/454 GS FLX sequencer using Titanium chemistry, 
which generated 5,080,113 reads (1,535,920,544 bp). 


Illumina 


To correct homopolymer errors in the 454 reads, an Illumina sequencing library was 
prepared from five female 77 basalis in the same culture. The DNA extract was pre- 
pared for Illumina sequencing using a Nextera DNA Sample Preparation Kit (Epicen- 
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tre Biotechnologies, Madison, Wisconsin, USA). Sequencing was conducted on an 
Illumina Genome Analyzer IIx (Illumina, San Diego, California, USA) at the Nucleic 
Acid Shared Resource (College of Medicine, The Ohio State University, Columbus, 
Ohio, USA). In total, 29,780,645 51-bp reads (1,518,812,895 bp) were generated. 


Oxford nanopore 


High molecular weight DNA was extracted from approximately 100 unsexed 77. basalis 
using a Gentra Puregene Tissue Kit (Qiagen, Hilden, Germany) following the man- 
ufacturer’s protocol. DNA quality was estimated using an Agilent Bioanalyzer. The 
DNA library was prepared using a Ligation Sequencing Kit 1D. Sequencing was per- 
formed on a R9.5 flow cell using an Oxford Nanopore MinION (Oxford Nanopore, 
Oxford, United Kingdom). The 48-hour MinION sequencing run generated 341,751 
reads (1,047,061,835 bp). All steps, excluding DNA extraction, were conducted at The 
Molecular and Cellular Imaging Center (MCIC; The Ohio State University, Wooster, 
Ohio, USA). 


Processing of sequencing reads 


Pyrosequencing reads are particularly susceptible to the accumulation of homopoly- 
mer errors (Huse et al. 2007). These were corrected using HECTOR (version 1.0.0; 
Wirawan et al. 2014). Adapter sequences were removed from nanopore reads with 
Porechop (version 0.2.3; https://github.com/rrwick/Porechop) and reads with internal 
adapter sequences were split into two reads. 


Genome assembly 


The 77. basalis genome was assembled following a hybrid approach that utilized short 
(454, Illumina) and long read (Oxford Nanopore) sequencing technologies. 454 and 
nanopore reads were assembled with SPAdes (version 3.11.1; Bankevich et al. 2012), 
with the initial assembly (assembled in 2010 by NFJ) treated as ‘trusted contigs’ and 
the ‘careful’ flag turned on to minimize misassemblies. The assembly was polished 
with single-end 51 bp Illumina reads for 4 iterations using Pilon (version 1.22; Walker 
et al. 2014). The polished assembly was then scaffolded with RNA-seq reads from 
pooled tissues of male and female 77. basalis using rascaf (version 20161129; Song et 
al. 2016; Chen et al. 2021a) to produce the final assembly. 


Assembly statistics and quality 


Genome statistics were calculated with QUAST (version 4.5; Gurevich et al. 2013). 
Genome assembly completeness was assessed with BUSCO (version 4.0.6; Sim4o et 
al. 2015) using the Metazoa, Arthropoda, Insecta, Endopterygota, and Hymenop- 
tera datasets. 
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Genome annotation 


The 77. basalis genome was annotated following the protocol of Daren Card (Depart- 
ment of Organismic & Evolutionary Biology, Harvard University), with modifications 


(https://gist.github.com/zjlahey/3c400c3039eef674e335d3d850ad595f). 


Repetitive elements 


Repetitive elements were identified and annotated with RepeatModeler (version open- 
2.0.1; Flynn et al. 2020) and RepeatMasker (version 4.1.0; Smit et al. 2014). First, a 
custom repeat library was generated for 77, basalis using RepeatModeler. This repeat 
library was then combined with a curated arthropod repeat library from RepBase (Bao 
et al. 2015), which was used to mask complex repetitive elements in the 77. basalis 
genome using RepeatMasker. 


Protein-coding genes 


Protein-coding genes were annotated in an iterative fashion with MAKER (version 
3.01.03; Campbell et al. 2014). MAKER utilizes external evidence in the form of 
protein and transcript sequences from other organisms to train ab-initio gene predic- 
tion software to annotate genes within a genome. In the first iteration, external evi- 
dence was supplied to MAKER as (1) TransDecoder-derived coding sequences (CDS) 
from each 77. basalis transcriptome assembly; (2) CDS from Telenomus remus Nixon 
(Huayan Chen, unpublished data); (3) TransDecoder-derived protein sequences from 
each 77. basalis transcriptome assembly; (4) all 170 arthropod proteomes in OrthoD- 
Bv10.1 (http://www.orthodb.org/); and (5) the UniProtKB/Swiss-Prot protein data- 
base (Bateman et al. 2020). Subsequent rounds utilized SNAP (version 2006-07-28; 
Korf 2004) and Augustus (version 3.3.3; Stanke and Waack 2003) to improve the gene 
models from the first iteration and identify new genes in the assembly. MAKER was 
run for three iterations, until the number of gene models and average length of each 
gene declined. Conserved domains within proteins of the final gene set were identified 
using InterProScan (version 5.46-81.0; Blum et al. 2020), and conserved functions 
were determined by performing a BLASTp (version 2.6.0) of the gene set against all 
metazoan proteins in the Swiss-Prot Uniprot database (Bateman et al. 2020). Finally, 
COGNATE (version 1.0; Wilbrandt et al. 2017) was employed to generate summary 
statistics of the annotated protein set (no. of exons/introns, avg. gene length, etc.). 


Non-coding RNAs 


We followed the protocol on Rfam (https://docs.rfam.org/en/latest/genome-annota- 
tion.html) to identify and annotate non-coding RNAs with Infernal (version 1.1.3; 
Nawrocki and Eddy 2013) and Rfam (version 13.0; Kalvari et al. 2018). Nuclear trans- 
fer RNAs were annotated with tRNAscan-SE (version 2.0.6; Lowe and Eddy 1997). 
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Gene family analysis 


Taxon sampling and protein datasets 


To estimate gene gains, losses, and rapidly evolving gene families within 77. basalis, 
we conducted a gene family analysis using the 7 basalis proteome and the protein 
sequences of six additional hymenopterans. Taxa were chosen based on the availabil- 
ity of hymenopteran proteomes and included three members of Proctotrupomorpha 
[Belonocnema kinseyi Weld (Cynipidae), Nasonia vitripennis (Walker) (Pteromalidae), 
and Trichogramma pretiosum Riley (Trichogrammatidae)]; one member of Ichneumo- 
noidea [Microplitis demolitor Wilkinson (Braconidae)]; one member of Orussoidea 
[Orussus abietinus (Scopoli) (Orussidae)]; and the turnip sawfly, Athalia rosae (L.) (Ten- 
thredinidae). Protein sequences of A. rosae, O. abietinus, M. demolitor, N. vitripennis, 
and T. pretiosum were downloaded from OrthoDB v10 (Kriventseva et al. 2019). The 
B. kinseyi proteome (then under the name B. treatae (Mayr) (Zhang et al. 2021)) was 
downloaded from NCBI. Redundant isoforms of multicopy genes in the B. kinseyi 
proteome were removed prior to analysis. Proteomes downloaded from OrthoDB did 
not require this step. 


Gene family identification and clustering 


Orthogroup inference was conducted with OrthoFinder (version 2.5.2; Emms and 
Kelly 2019) at default parameters (DIAMOND, MAFFT, FastTree). Due to com- 
putational limitations associated with using IQ-TREE at the tree inference step 
of OrthoFinder, we performed a separate phylogenetic analysis on the same 4,510 
orthologues (Species TreeAlignment.fa) identified during the initial run using IQ- 
TREE (version 2.1.2; Minh et al. 2020). The final step of OrthoFinder was then 
rerun with the species tree produced by IQ-TREE as input (orthofinder.py -ft 
RESULTS_DIR -s IQ-TREE_SPECIES_TREE). We then converted the species 
tree to a time-calibrated ultrametric tree using the OrthoFinder accessory script 
make_ultrametric.py, with the root node calibrated at 265 mya based on the esti- 
mated divergence time between Athalia Leach and Orussus Latreille in the Time- 
Tree database (Kumar et al. 2017). 


Gene family evolution 


Rates of gene gain and loss (i) were estimated with CAFE (version 4.2.1; Han et al. 
2013) using the orthogroup count data and ultrametric time-tree produced by Or- 
thoFinder as input. Prior to running CAFE, we modified the orthogroup count data 
file by removing gene family clusters where only a single species was present (Prost 
et al. 2019). This step reduced the number of gene families from 11,205 to 10,190. 
Finally, we accounted for possible deviation in the number of observed vs true gene 
family counts by estimating an error model (¢) to optimize the value of i. 
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Data availability 


This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank 
under the accession JAAMPD000000000. Raw DNA sequencing reads (454, Illumi- 
na, Nanopore) are available at the Sequence Read Archive by searching for BioProject 


Accession PRJNA49235. 


Results and discussion 


Genome sequencing and assembly statistics 


We assembled the genome of 77. basalis de novo using sequence data from second- 
and third-generation sequencing technologies. The combined read output from all 
sequencing platforms totaled 4.10 Gb (27.9x coverage). These reads were assembled 
into 7,568 scaffolds, totaling 147 Mb in length (34.7% GC content). The scaffold 
N50 was 42.8 kb, and the longest scaffold measured 349,262 kb. Given low read 
coverage, we were unable to estimate genome size in silico. However, the 77. basalis 
draft assembly size falls within the range of genome size estimates of other platygas- 
troids, which are typically between 200 and 400 Mb (data not shown), in addition 
to the average genome size range of other hymenopterans. We assessed genome as- 
sembly completeness with BUSCO, using the Insecta odbv10 database (N = 1,367) 
in genome mode and with the ‘long’ flag enabled to perform a more thorough 
search. We recovered 93.1% complete, 1.8% duplicated, 1.4% fragmented, and 
3.7% missing Insecta BUSCOs in the 77 basalis genome. These values compare 
favorably with other parasitoid Hymenoptera with more contiguous genome as- 


semblies (Fig. 1). 


Genome annotation 


Repetitive elements 


RepeatMasker annotated 13.8% of the Tr basalis genome as composed of repeats, 
approximately half of which were unclassified repeats (7.1%). The most abundant clas- 
sified repetitive elements were various LINE and LTR retroelements (3.0%); DNA 
transposons (1.3%); and simple repeats (1.5%). The repeat landscape of 7 basalis 
shows a relatively uniform distribution of repeat classes, with a gradual decline in the 
proportion of LTR retroelements and an increase in the proportion of DNA transpo- 
sons (Fig. 1). Subtle increases in the proportion of LINEs and rolling circle transpo- 
sons are evident between Kimura distances 0.04 and 0.12. SINEs contribute little to 
the overall repeat content in 77 basalis, and other Hymenoptera, in general (Petersen 
et al. 2019). A complete list of the repeats found in the 77 basalis genome is in the 
Suppl. material 1. 
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Figure |. Morphological and genomic traits of Trissolcus basalis A head of female 77. basalis reared from 
BMSB eggs in Tuscaloosa, Alabama, USA (FSCA 00090269 B repeat landscape plot of different TE classes 
within the 77. basalis genome. Nucleotide sequence divergence in each TE copy was calculated as the 
Kimura distance between the annotated TE copies in the genome and the consensus sequence of each 
TE family C ultrametric timetree depicting the position of 77. basalis relative to six other hymenopterans 
inferred from a phylogenetic analysis of 4,510 single-copy protein-coding genes identified by OrthoFinder. 
Numbers above branches (left to right, separated by forward slashes) indicate gene family expansions, gene 
family contractions, and the number of rapidly evolving gene families in each lineage. Each branch received 
100% SH-aLRT and UFBoot2 support values D genome assembly completeness comparison based on 
the proportion of BUSCOs recovered in each genome using the Insecta odbv10 dataset (N = 1367). Ab- 
breviations: BMSB, brown marmorated stink bug; C, complete; D, duplicated; DNA, DNA transposon; 
F, fragmented; LINE, long interspersed nuclear element; LTR, long terminal repeat; M, missing; mya, 
million years ago; RC, rolling circle transposon; S, single-copy; SINE, short interspersed nuclear element; 
TE, transposable element; Aros, Athalia rosae; Bkin, Belonocnema kinseyi; Mdem, Microplitis demolitor; 
Nvit, Nasonia vitripennis; Oabi, Orussus abietinus; Thal, Trissolcus basalis; Vpre, Trichogramma pretiosum. 
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Protein-coding genes 


The MAKER genome annotation pipeline resulted in 14,158 protein-coding gene 
models. Approximately 95% (13,507) of the 14,158 gene models have an annota- 
tion edit distance (AED) score of less than 0.5, and 70% (9,915) contain at least one 
recognizable InterPro domain. AED is a quality control metric that explains how well 
the gene annotations produced by MAKER match external evidence (i.e., proteomes 
from other species). Ihe AED values and proportion of gene annotations with a recog- 
nizable InterPro domain for the 77. basalis genome are indicative of a well-annotated 
assembly (Holt & Yandell, 2011). In addition, nearly half of the protein set (6,929 or 
48.9%) was assigned at least one gene ontology (GO) term. To determine how well 
our annotated protein set compares with external protein databases, we queried our 
protein annotations against those of the metazoan portion of the Swiss-Prot/UniProt 
database and all Hymenoptera protein sequences deposited in GenBank (last accessed 
March 17, 2021). A total of 9,303 (65%) and 12,197 (86%) of the annotated proteins 
in 77. basalis were supported by a best BLAST p hit in the Swiss-Prot/UniProt database 
and GenBank, respectively. A table of the most frequently recovered InterPro domains, 
GO terms, and Pfam entries associated with the 77 basalis protein set is available in 
the Suppl. material 1. 


Non-coding RNAs 


Seventy-five different RNA families were annotated in the 77 basalis genome. The 
top 5 most common families belong to the tRNA (RF00005), Histone3 (RF00032), 
5S_rRNA (RF00001), SSU_rRNA_eukarya (RFO1960), and LSU_rRNA_eukarya 
(RF02543) RNA sequence families. We also identified both conserved regions of the 
Sphinx long non-coding RNA gene, which plays a role in the regulation of male mating 
behavior in the fruit fly Drosophila melanogaster Meigen (Wang et al. 2002; Dai et al. 
2008). Within Hymenoptera, Sphinx has been reported from 15 taxa including three 
species of parasitoid in the pteromalid genus Nasonia Ashmead (Werren et al. 2010). 
Its role in regulating mating behavior in Hymenoptera is not known. Additional statis- 
tics of the ribosomal DNA within the 77 basalis genome are in the Suppl. material 1. 


Gene family analyses 


We compared the annotated proteome of 77. basalis with those of six other hymenopter- 
ans with well-annotated genomes. Orthogroup clustering performed with OrthoFind- 
er assigned 81,474 (93.4%) of the 87,222 protein sequences into 11,205 orthogroups. 
The number of orthogroups with all species present was 6,295 and 4,510 of these 
were identified as single-copy orthologues. Regarding 77. basalis, 81.8% (11,582) of its 
genes were assigned to an orthogroup, and 76.7% (8,599) of orthogroups contained 
Tr. basalis. The number of orthogroups specific to 77, basalis was 173, and the number 
of genes within these 173 species-specific orthogroups was 1,026 (7.2% of the 11,582 
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genes assigned to an orthogroup). The number of unassigned genes in 77. basalis was 
much higher than the taxa with which it was compared. Potential explanations for this 
discrepancy are (1) the fragmentary nature of the 77. basalis draft assembly leading to 
truncated protein models and (2) inaccurate gene annotations. Increasing genome con- 
tiguity using additional long-read sequencing technologies and chromosome confirma- 
tion capture would decrease the incidence of truncated protein models, and manual 
curation of the gene models would aid in the identification of false positives. 

The orthogroup count data and ultrametric timetree produced by OrthoFinder 
were used to estimate the rate of gene family evolution with CAFE. We estimated 
the rate of gene family evolution (gains and losses) in this group of Hymenoptera 
at 0.0008, after accounting for possible genome assembly/annotation error. This re- 
sult is in line with a recent multi-order gene family analysis that reported the rate of 
gene family gain and loss in 24 hymenopteran taxa at 0.0009 (Thomas et al. 2020), a 
gene turnover rate slower than Coleoptera (0.001), Diptera (0.001), and Lepidoptera 
(0.0014). In total, 638 gene families were identified as rapidly evolving among the 7 
hymenopterans included in this study. 

We identified 174 (99 expansions and 55 contractions) rapidly evolving gene fami- 
lies in Tr. basalis, with most (91) rapidly expanding families containing at least one 
member with an InterPro, PANTHER, or Pfam annotation (Suppl. material 1), and 
slightly fewer than half with at least one corresponding GO term (48). Notable examples 
of gene families undergoing rapid evolution in 77. basalis are three groups of olfactory 
receptors (contracting in OG0000089; expanding in OG0000163 and OG0000567), 
one group of 7-transmembrane chemoreceptors (expanding, OG0000365), and one 
group of pheromone/general odorant binding proteins (expanding, OG0009810). The 
chemoreceptor repertoire of 77, basalis was recently treated by Chen et al. (2021a) who 
employed sex- and tissue-specific transcriptome assemblies, in addition to the 77. basalis 
genome, to annotate its gustatory, olfactory, and ionotropic receptor genes. One family 
of proteins not treated by Chen et al. (2021a), yet integral in the recognition and deliv- 
ery of odorant molecules to their respective odorant receptors, are the odorant binding 
proteins (OBPs) (Pelosi and Maida 1995). OBPs are small, water-miscible polypeptides 
that solubilize and deliver volatile, hydrophobic compounds to the membrane of che- 
mosensory receptor neurons for further processing (Pelosi et al. 2018). Therefore, OBPs 
are the first component in a multistep process that begins with semiochemical binding 
and culminates in a behavioral response. We are only beginning to investigate the OBP 
repertoire in 77 basalis; however, given the quality of the 77 basalis draft genome, we 
have identified, annotated, and characterized 18 putative OBPs, and determined those 
that exhibit antennal-biased expression patterns (King et al. 2021). 


Author’s note 


While this manuscript was in preparation, Xu et al. (2021) published a highly contigu- 
ous, chromosome-scale genome assembly of Te. remus (Platygastroidea: Scelionidae), 


a telenomine egg parasitoid of the fall armyworm Spodoptera frugiperda (J. E. Smith) 
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(Lepidoptera: Noctuidae). 7elenomus Haliday occupies an important phylogenetic po- 
sition within the family Scelionidae as the sister taxon to Trissolcus Ashmead (Taekul 
et al. 2014; Chen et al. 2021b), and thus serves as an ideal candidate with which to 
compare the 77 basalis genome assembly reported here. A preliminary investigation 
into some commonly reported genome metrics corroborates several features that may 
be characteristic of telenomine genomes: (1) small genome size (< 150 Mb); (2) low 
repetitive element content; (3) approximately 15,000 protein-coding genes; and (4) 
similar rates of gene family evolution (Xu et al. 2021). These differences were discern- 
ible between the two genomes despite the stark contrast in assembly contiguity (i.e., 
11.9 Mb scaffold N50 for Te. remus; 42.8 kb scaffold N50 for Tr. basalis). This sug- 
gests that even highly fragmented genome assemblies can be of sufficient quality to 
infer genome-scale parameters accurately. We anticipate comparative genomic analyses 
between Te. remus and Tr. basalis will result in major discoveries related to genome 
evolution within Hymenoptera and the genomic factors implicated in host location, 
host acceptance, and the biological control potential of both species. 
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